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' Abstract 

In this paper we propose an algorithm to classify tensor data. Our methodology is 
built on recent studies about matrix classification with the trace norm constrained weight 
matrix and the tensor trace norm. Similar to matrix classification, the tensor classification is 
formulated as a convex optimization problem which can be solved by using the off-the-shelf 
accelerated proximal gradient (APG) method. However, there are no analytic solutions 
as the matrix case for the updating of the weight tensors via the proximal gradient. To 
tackle this problem, the Douglas-Rachford splitting technique and the alternating direction 
method of multipliers (ADM) used in tensor completion are adapted to update the weight 
tensors. Further more, due to the demand of real applications, we also propose its online 
learning approaches. Experiments demonstrate the efficiency of the methods. 

> 

1 Introduction 

m 

Tensor or multi-way data analysis have many applications in the field of psychometrics, econo- 
' metrics, image processing, signal precessing, neuroscience, and data mining [1]. Tensors are 

higher-order equivalent of vectors and matrices. In this paper, we consider the classification of 
tensors, which is a generalization of the matrices classification problem proposed by Tomioka 
and Aihara in [2]. The tensor classification model is formulated as: 



o 



- 1—1 

X 



f(X;W,b) =< W,X > +b (1) 



where W, X 6 ^hxhx---xiN are ]\r_ wa y tensors, X is the input tensor for which we would like 
to predict its class label y; W is called the weight tensor and b € K is the bias. Thus we need to 
infer the weight tensor and bias from the training samples {Xi, j/i}| =1 . This formulation makes 
the work in [2] as a special case that the tensors evolved have an order of N = 2. 

In the work of matrix classification, Tomioka and Aihara use a norm regularized scheme 
based on trace norm of the weight matrix [2]. Recently this trace norm regularization scheme 
has been studied in various contexts, namely, multi-task learning [3], matrix completion [4,5], 
and robust principle component analysis [6]. In this paper, similarly to matrix classification, a 
trace norm for tensors may be introduced to control the complexity of the weight tensor and 
the deviation of the empirical statistics from the predictions together. Recently, Liu et al. [7] 
proposed a definition for the tensor trance norm: 
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where Xr^ is the mode-i unfolding of X, WX^W* is the trace norm of the matrix Xu\, i.e. the 
sum of the singular values of Xu\, and if N = 2, this tensor norm is just the ordinary matrix 
trace norm. Now the weight tensor and bias learning problem becomes a convex optimization 
problem 

minF s (W,b) = f s (W,b) + MMl, (3) 

where / S (W,6) = Sj = i^(2/*,< W,Xi > +b) is the empirical cost function induced by some 
convex smooth loss function £(■,■), and A is the regularization parameter. The subscript of 
f s (W, b) indicates the number of training samples or time of training procedure which is apparent 
from context. 

For such convex optimization problem, Toh and Yun [8], Ji and Ye [9], and Liu et al. [10] 
independently proposed similar algorithms in the context of matrix related problems via using 
accelerated proximal gradient (APG) based methods. In this paper, we adapted the APG based 
algorithm to this tensor convex optimization problem. Unfortunately, unlike the Theorem 3.1 in 
[9] for matrix case, there is no closed analytic solution of the weight updating rules in the APG 
algorithm for the tensor case due to the dependency among multiple constraints. In order to 
solve the weight updating problem, the Douglas- Rachford splitting technique and the alternating 
direction method of multipliers [15,16], which have been successfully used in tensor completion 
tasks [7,11], are employed. 

Furthermore, in order to cope with the situations that huge size training set for the data 
cannot be loaded into the memory simultaneously or the training data appear in sequence (for 
example video processing), we propose the online implementations of the above algorithms. 



2 Notations 

We adopt the nomenclature used by Kolda and Bader on tensor decompositions and applications 
[1] . The order N of a tensor is the number of dimensions, also known as ways or modes. Matrices 
(tensor of order two) are denoted by upper case letters, e.g. X, and lower case letters for the 
elements, e.g. Xij. Higher-order tensors (order three or higher) are denoted by Euler script 
letters, e.g. X, and element (ii,t2, • • • , ijv) 01 a iV-order tensor X is denoted by Xi 1 i 2 ...i N . Fibers 
are the higher-order analogue of matrix rows and columns. A fiber is defined by fixing every 
index but one. The mode-n fibers are all vectors Xi 1 ...i n _ 1 u n+1 ...i N that obtained by fixing the 
values of {ii, «2, ■ ■ ■ , ^iv} \ in- The mode-n unfolding, also knows as matricization, of a tensor 
X e j'lXfcx-xiN j g denoted by A(„) and arranges the model-n fibers to be the columns of 
the resulting matrix. The unfolding operator is denoted as unfold(-). The opposite operation is 
refold(-), denotes the refolding of the matrix into a tensor. The tensor element (ii,i2, ••• , ijv) 
is mapped to the matrix element (i n ,j), where 

N fc-1 

3 = 1 + 5i^ fe ~ i ^ jk with Jk = n im 

fe=l m=l 

Therefore, X/ n ^ e K / nXii---J n _iJ n+ i---J N _ n _ rcm £ f a _/V-dimensional tensor X, denoted as 
rankn(A') is the column rank of A(„), i.e. the dimension of the vector space spanned by the 
mode-n fibers. The inner product of two same-size tensors X,y e ^iixi 2 x — xi N j s defined as 

I\ Ii In 

< X, y >= y ' y ' • • • y ' Xi 1 i 2 ...i N yi 1 i 2 ...i N . 

i\ = \ii = \ ijv = l 
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The corresponding norm is ||A?||f = y/< X, X >, which is often called the Frobenius norm. 

3 Accelerated Proximal Gradient Method 

It is known [8] that the gradient step 

Wj k = Wfc_i-iv w / a (Wfc_i,6) (4) 
tk 

for solving the following smooth problem with fixed bias b 

rrdn f s (W,b) (5) 

without trace norm regularization can be formulated equivalently as a proximal regularization 
of the linearized function f s (W, b) at Wk~i as 

W fe = argmin P tk (W, Wk-i), (6) 

where 

Pt fc (W,Wfc_i) = /.(Wfc_i,6)+<W-Wfc_i,V w /.(Wfc_i,6) > +~||W-W*_i||| (7) 

and Vw/s(") b) is the gradient of f s (-, b) with respect to W. 

Based on this equivalence relation, Toh and Yun [8], Ji and Ye [9], and Liu et al. [10] 
proposed to solve the optimization problem in Eq. ([3]) by the following iterative step: 

W k = argmin Q th (W,W k -i) 4 P tk (W,W k -i)+ A|| W||* (8) 
w 

or equivalently 

W fc = argmin{^||W- (W fe _! - ^-V w /.(W fc -i,6))|||. + A||W||*}. (9) 

Unfortunately, when the order of the tensor evolved in the problem is three or higher, there 
is no closed analytic solution to the above problem due to the tensor norm. This is contrast to 
the matrix case, where the Eq. ([9]) can be solved by singular value decomposition (SVD) and soft 
"shrinkage" like the theorem 3.1 in [9]. However, the Douglas- Rachford splitting technique and 
the alternating direction method of multipliers can be used to solve Eq. ^ for higher tensors. 
These methods will be described in the next section. Now, we assume that the Eq. ([5]) or Eq. ([5]) 
can be properly solved. 

In general APG methods, the Lipschitz constant for Vyv/ s (-,6) is unknown, so it is need to 
estimate the appropriate step size tk to guarantee the convergence rate [8,9,10]. In this work, 
the standard squared loss function is used in Eq. ((3]). With this loss function, we can explicitly 
compute the Lipschitz constant in Lemma 13. II Thus the step size estimation can be omitted in 
our tensor classification problems. 

Lemma 3.1. V;y/ s (-,&) is Lipschitz continuous with constant L = 2j^[ m=1 / m H^ll^,, i.e., 

\\v w f 8 (u,b)-v w fs(v,b)\\F <L\\u-v\\ F yu,veR hxl2X - xlN , (10) 

where \\-\\ F denotes the Frobenius norm. 
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Proof. With the standard squared loss, the gradient of / S (VV, b) with respect to W is 

s 

V W / S (W,6) = -2j2(Vi- < > ~b)Xi, 

8=1 

Applying Eq. (JTTJ) with £/, V to the right of Eq. ([10]). we obtain 

||V w / s (W,6)-V w / s (V,6)|| ir 
= II -2 V S (j/i- <W,A' J > -6)^ + 2V S (vi-<V,Ai>-6)^ 

|| * »2= 1 * ^2 — 1 

= 2 1| V s (<w,a; > - < v,Xi >)Xi\\ 

ii* — ^i— i iii? 
< 2 V S |< W- V,Ai >|||^|| F 

* 4 %— 1 



(11) 
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< 2rj^E l=1 i^- v i 

m— 1 

TV 



where in the last inequality, the easily verified fact that < A,B >< WA^ WB^ < Y[ m =i -*m 
iorVA,B € R 1 x^x-'-x-fw is used. Here || - 1| ^ denotes the ^i norm which is the sum of the absolute 
values of the tensor elements. 

Thus the lemma is proved, that is to say Vyy/ s (-,6) is Lipschitz continuous with constant 



A\\ F \\B\\ F 



L = mZ=ii m zu\\Xif F - 
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Based on the the work of Nesterov [13,14], Toh and Yun [8], Ji and Ye [9], and Liu et al. [10] 
showed that introduce a search point sequence Zk = Wk + tfc ^~ 1 (Wfe — Wfe_i) for a sequence 
tk satisfying t\ +1 — tk+i < t\ results in a convergence rate of O(-j^). Based on their results, we 
adapted the APG algorithm to the tensor classification case and summarized in Algorithm [TJ 
In this algorithm, the step of line 2 is not explicit. In the next section we will introduce some 
methods to solve this problem. 

When the weight tensor is obtained, the bias b can be derived by solving the following 
problem with fixed weight tensor 



b k = argmin^ { Vl - < Wfe, X, > -bf + A ||W fe ||J, 

i=l 

which results in the bias updating rule 

1 s 

bk = -Y(yi-<W k , Xi>). 



(12) 



(13) 



4 Minimization via Gandy's Algorithms 

Apparently that the problem of Eq. © or line 2 in Algorithm [1] fulfils the recently proposed 
tensor completion formulation [7,11]. For tensor completion, Gandy proposed two algorithms 
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Algorithm 1 Weight Tensor Learning via APG 



Input (Xi,yi),i = !,■■■ >s- 

Initialization VV = Z x e Rhxi 2 x-xi N ^ ai = I, L = 2l\? n=1 I m £ \\X t \\ F , A, k = 1. 
1: while not converged do 

2: W k = argmm{^\\W-(Z k -^V w f s (Z k ,b))\\ F + X\\W\U}. 



w 

ak+i = — ^4 • 



k <- k + 1. 
end while 
Output: W <- W k - 



based on Douglas-Rachford splitting technique and the alternating direction method of multi- 
pliers (ADM) respectively. In this work, we adapt these two methods to solve the problem (|9|). 

Douglas-Rachford splitting technique based method: The Douglas-Rachford splitting tech- 
nique has a long history [15,16]. It addresses the minimization of the sum of two functions 
(/ + 9)( x ): where / and g are lower semicontinuous convex functions. The Douglas-Rachford 
splitting technique asserted that prox Ag (i) is a minimizer of (/ + g){x), where x is the limit 
point of the following sequence: 

x n+ i := X„ + i„{prox A/ [2prox Ag (x„) - x n ] - prox Ag (x„)}, (14) 

where t n £ [0,2] satisfies J2 n >o^ n (^ ~~ = 00 an( ^ the proximal map prox Aff (-) is defined as 
[17,18]: 

prox A/ : x i-» argmin{/(y) + —\\x - y\\ 2 }. (15) 



We first formulate the problem in step 2 of Algorithm [T] into the unconstrained minimization of 
(/ + g)(x). Let # := R^x-^x-xi^ define a Hilbert space £ := gxgx ■ ■ ■ x with the inner 

N+l terms 

product < X, 2) >^ := -^TT Sj=o < <^jiVi >• Then the problem can be rephrased as: 

minimize f(W) + g(W), (16) 

where W = (W 0) Wi, • • • , Wjv), -D = {2U e ^ |W = W x = • • • = W N }, and 

/(2U) = -||w - Pill + £ jr\\ w m\\*> ( 17 ) 

g(W)=i D (W) = \ ' . (18) 

I +oo, otherwise 

where V = Z^-i — jVw/ s (2n, b). Then in order to apply the stand DR splitting technique, 
the proximal maps of /(2U) and g(2U) need to be identified. 
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The proximal map of /(2B) is given by 

L N A 1 

prox 7/ 2B = argmm{-||W -P||| + ^-||W i , w ||* + — 1|2) - 2U||£ } 

i— 1 

r w a i w 

i—1 ^ ' i—Q 

= (prox (JV+lh( i|| W _ P | | 2, ) W ,prox (Ar+1)7( ^|| M/i - , prox (Ar+1)7( ^\\ Wtf , iin \\,) W ^) 

For prox (Ar+1)7( L L| | VV _p||2, ) yV , we have 

^I» w -^»' + 2(^ l|W " %11 ^ = ( ^ + ^vTTh^ 7 ^ + ^vTTh^ (19) 

For proX( JV+1 ) 7 ( x y w . ^u^VVi, i = 1, • • ■ ,N, by Theorem 3.1 in [9], we have 

axgmin{A|| WiiW ||, + — i— ||W - ^|||} - refo\d(US MN +^ \S]V T ), (20) 

where USV T is the SVD of 5t,($), the refold(-) is referred to Section [51 and the <S e [-] is the 
soft-thresholding operator introduced in [19]: 

x — e, if x > e, 

S e [x] = { x + e, ifa:<-e, (21) 
0, otherwise 

where x <E R and e > 0. For vectors and matrices, this operator is extended by applying 
clement- wise. 

The proximal map of the indicator function <7(22J) is simply given by 

P rox 79 2u = (gfr,...,an), 

where = ^ Eti 

Now apply Eq. (|14[) . we obtain the iteration rules for the original problem: 

W fc+1 - W k a + argmin(|||W - V\\% + \\W - (222J - W fc )|||) - OT, (22) 

W* +1 = W f + argmm(A||W w ||. + 1 || W _ - - S,i = 1, • • • , N. (23) 

The convergence is guaranteed by Theorem 4.1 in [11]. When it converges, the weight tensor is 
W. 

ADM based method: The ADM based method goes back to last century [20]. The approach 
consists of iteratively updating the original variables and finally carrying out the update of the 
dual variables. Each update involves a single variable and is conditioned to the fixed value of the 
others. In order to use the ADM in tensor completion, Gandy introduced N new tensor-value 
variables that represents the N different modc-n unfoldings of the original tensor, then form the 



(i 



augmented Lagrangian and update all the variables one at a time. Following Gandy's method, 
we introduce JV new variable € M. IlXl2X "' xlN and rephrase line 2 of the Algorithm Q] as 

N 

min IHW-PIII+ Agll^^ ( ^ 
subject to 3^ = W Vi G {1, • • • , JV}. 

Let /(W) - ,g(2}) = ^Eililln,(i)ll*, where 2J = ,^) T . Thus the 

constrain over 2J and W is 2) = (W, • • • , W). Then the augmented Lagrangian of Eq. (fM|) 
becomes 

r w A R 

£ A (W,2),it) = -||W - Vf F + < Wi , W - 3* > +|||W - y t \\ 2 F ) (25) 

i=l 

where the parameter j3 is any positive number and il = (U\ , • • • , Un) t is the Lagrange multiplier. 
By minimization Ca(W, 2J, 11) with respect to each single variable and other variables fixed, we 
obtain the updating rules of all the variables 2J,W,H 

JV JV 
W k+l = ( LV + psQy. + Y^ U .y( L + 

yf +1 = vefo\d(US~x [S]VT), i = 1, ■ ■ ■ , JV, (26) 
U* +1 = U? - (3{W hN+1 ~ y? +1 ), i = l,---,N, 

where USV T is the SVD of (W^ 1 - 

Until now we have proposed two methods to solve the tensor classification problem. In the 
next section, we discuss the online implementation of the proposed learning processes. 



5 Online Learning 

The above proposed methods are iterative batch procedures, accessing the whole training set 
at each iteration in order to minimize a weighted sum of a cost function and the tensor trace 
norm. This kind of learning procedure cannot deal with huge size training set for the data 
probably cannot be loaded into memory simultaneously, furthermore it cannot be started until 
the training data are prepared, hence cannot effectively deal with the training data appear in 
sequence, such as audio and video processing. 

To address these problems, we propose an online approach that processes the training sam- 
ples, one at a time, or in mini-batches to learn the weight tensor and the bias for tensor classi- 
fication. We transform the above algorithm to the online learning framework. The framework 
is described in Algorithm [5] in which we also include the bias updating steps. 

Our procedure is summarized in Algorithm [2j The (g> operator in step 6 of the algorithm 
denotes the Kronecker product which is similar to matrix Kronecker product. Given two tensors 
A e K J i'" x/ ^ and B € K j i x ' xJ ™ with equal order JV, A <g> B denotes the Kronecker product 
between A and B, results as a tensor in M. IiJiX "' xInJn , defined by blocks of sizes Jx X • • • X Jm 
equal to a,i 1 ...i N B. GridTr(W, Bt) in step 13 denotes an operator with input W £ Mr 1 ' " xIn and 
B t € ^IiJix-xi n Jn ^ resu it hi ^h-xi N w ith the (»!,••• ,ijy)th element defined as the inner 
product between W and the (h, - ■ ■ , iiv)th R h ' " x/lv block of B t . 
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Algorithm 2 Online learning for tensor classification via APG 
Initialization W = E M. IlXhx - xlN ,b G R, X. 

1: Ao e R'lx^x-x/iv <- o,6 E R^x^x-x/jv/n ^ 0, c G M ^ 0,V € R^x^*"*-^ 
0, Lo = £ R (reset the "past" information). 
2 
3 
4 
5 
6 
7 



for t = 1 to T do 

Draw training sample (X t ,yt) from p(X,y). 
II Line 5-9 update "past" information. 
A t <- A-i + &<*«; 
Bt <- + #t <g> # t ; 
Q <- ct-i + 
Dt <-23 t _i+^t; 

it <- it-! + 211^1^11^11^- 



10: // Line 11-19 compute Wt using the APG method, with Wt-i as warm restart. 

11: W ,t = Zi.t = Wt_i G R /lX/2X - x/ ",6 ,t = &t-i,ai = 1,*= 1. 
12: while not converged do 

13: W M = argmin ^||W - (Z M + %(A t - GridTr(Z M , B t ) - b k ^ t V t ))\\ 2 F + X\\W\\ 
14: a k+l = 

15: Z fc+M = W fc)t + ^(Wfc.t - W fc -i,t). 
16: b k . t = \{c t - <W k , u V t >) 
17: fc <- jfe + 1. 
18: end while 

19: W t <-Wk,t,bt*-1>k,t. 
20: end for 

Output: W <- Wr, & <- &t- 



Assuming the training set composed of i.i.d. samples of a distribution p(X, y), its inner loop 
draws one training sample (X t ,y t ) at a time. This sample is first used to update the "past" 
information At— i, Bt— i, Cf_ i, 2?t_i, it— i. Then the Algorithm Q] is applied to update the 
weight matrix with the warm start Wt_i obtained at the previous iteration. Since i*t(W, &t_ i) 
is relative close to Ft_i(W, &t-i) for large values of t, so are Wt and Wt-i, under suitable 
assumptions, which makes it efficient to use Wt-i as warm restart for computing W t . 

For the stopping criteria of the inside iterations, we take the following relative error condi- 
tions: 

\\m+i,t - WkAW/iWrnAF + 1) < ei and \b k+1<t - 6*, t |/(|6 M | + 1) < e 2 . (27) 

In some conditions, use the classical heuristic in gradient descent algorithm, we may also 
improve the convergence speed of our algorithm by drawing fi > 1 training samples at each 
iteration instead of a single one. Let us denote by (Xt t x,yt,i), (Xt,n,yt,n) the samples drawn 
at iteration t. We can now replace lines 5 and 9 of Algorithm [5] by 

ix n ii 

At <- At-i + E Vt,iXt,i, B t <- B t -i + x t,i ® x t,i, c t <- ct_i + J2 Vt.ii 

i—1 i=X i=X {iq\ 

a a N 2 ( 2H > 

V t <- Vt-x + £ <*t,i. and L t «- i t _i + £ 2]lLi i« II*mIIf- 

i=l t=l 

But in real applications, this online with mini-batch update method may not improve the conver- 
gence speed on the whole since the batch past information computation (Eq. (|2"8"]l ) would occupy 
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much of the time. The updating of Bt needs to do Kronecher product which spend much of the 
computing resource. If the computation cost of Eq. (f2"5f can be ignored or largely decreased, for 
example by parallel computing, this mini-batch method would increase the convergence speed 
by a factor of /i. 

6 Experimental Validation 

In this section, we conduct experiments to demonstrate the characteristics of the proposed meth- 
ods for tensor classification problem. Six algorithms are compared: the batch learning algorithm 
with APG using DR methods (APG_DR); the online learning algorithm with APG using DR 
(OL_APG_DR); the batch learning algorithm with APG using ADM method (APG_ADM); the 
online learning algorithm with APG using ADM (OL_APG_ADM); OL_APG_DR with update 
Eq. @5J) (OL.APG_DR_miniBatch); OL_APG_ADM with update Eq. (|25J) (OL.APG.ADM_miniBatch). 
All algorithms are run in Matlab on a PC with an Intel 2.53GHz dual-core CPU and 3.25GB 
memory. 

For our experiments, we use randomly generated 2.4 x 10 5 3-order 10 x 10 x 10 tensors, 
which are composed of varied ranks (note that here the rank is not the n-rank mentioned above, 
here the rank concept related to CANDECOMP/PARAFAC decomposition, refer [1] for exact 
definition); 2 x 10 5 of these are kept for training, and the rest for testing. The goal is to 
classify the tensors according to their ranks. Hence we have made the tensor rank identification 
problem into a novel classification or regression formulation. We generate the rank-r tensor as 
a sum of r rank one tensors, where each rank one tensor is a outer product of 3 vectors whose 
elements are drawn i.i.d from the standard uniform distribution on the open interval (0, 1). For 
all the algorithm, the parameters in the stopping criteria (|2"T1) are E\ = 10~ 10 and £2 = 10~ 10 . 
The regularization constant A is anchored by the large explicit fixed step size L and the tensors 
involved, which means that in practice the parameter A should be set adaptably with the step size 
L in the online process. But due to this variation of A, the comparisons between the algorithms 
would not bring into effect. Hence in this work we use A = 1 throughout. Considering a balance 
between convergence speed and accuracy, we set (3 = 10 7 ,-f = 10~ 7 in this work. 

Figure Q] compares all the algorithms proposed in this work. The batch algorithm use a 
training set of 2 x 10 3 training samples, while the online algorithm draws samples from the 
entire training set. We use a logarithmic scale for the computation time. Figure HJa) shows 
the mean square tensor rank prediction errors as functions of time. It can be seen generally 
that all methods converge. In all these methods, ADM based methods converge faster than 
DR based methods. The batch learning methods converge faster than corresponding online 
learning methods with or without mini-batch past information updating. It can also be seen 
that when the size of the mini-batch used in online method increase, the speed of convergence 
will decrease, and the reason for this has been explained in the last paragraph of Section 
After all the methods converge, they result in almost equal performance. Figure QJb) shows the 
classification rates with tensor rank estimation error tolerances 77 = 1. Here the rank estimation 
error tolerance means that if the distance between the estimation rank value and the real rank 
value is less than 77, then the tensor classification would be right. The convergence of the 
classification accuracies are corresponding to the convergence of the mean square tensor rank 
prediction errors. With an error tolerance r\ = 1, the methods result in a classification rate of 
95.9%. 
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Figure 1: Comparison between various learning methods and results are reported as functions 
of learning time on a logarithmic scale. 

7 Conclusions 

In this paper, we have proposed methods to solve tensor classification problem with a tensor trace 
norm regularization. We successfully employed APG method to learn parameters, during which 
DR and ADM are used to update weight tensor. We also give out online learning implementation 
for all proposed methods. In addition, for standard squared loss function, we derive the explicit 
form of the Lipschitz constant, which saves the computation burden in searching step size. Our 
empirical study on tensor classification according to tensor rank demonstrates the merits of the 
proposed algorithms. This is, to our knowledge, the first work on tensor norm constrained tensor 
classification. Some future work are worth considering, such as that the alternating between 
minimization with respect to weight tensor and bias may results in fluctuation of target value, 
thus optimization algorithm that minimization jointly on weight tensor and bias are required; 
for multi-classification problems with more classes, some hierarchy methods may be introduced 
to improve the classification accuracy. 
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